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1. INTRODUCTION 

The communication of an algorithm (e.g., transferring data between the CPU and 
memory devices, or between parallel processors, a.k.a. I/O-complexity) often costs sig- 
nificantly more time than its arithmetic. It is therefore of interest (1) to obtain lower 
bounds for the communication needed, and (2) to design and implement algorithms 
attaining these lower bounds. Communication also requires much more energy than 
arithmetic, and saving energy may be even more important than saving time. 
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Communication time varies by orders of magnitude, from O (10 9 ) second for an LI 
cache reference, to O (10 -2 ) second for disk access. The variation can be even more 
dramatic when communication occurs over networks or the internet. While Moore's 
Law predicts an exponential increase of hardware density in general, the annual im- 
provement rate of time-per-arithmetic- operation has, over the years, consistently ex - 
ceeded that of time-per-word read/write HGraham et al. 2004[|Fuller and Millett 201111 . 
The fraction of running time spent on communication is thus expected to increase fur- 
ther. 



1.1. Communication model 

We model communication costs of sequential and parallel architecture as follows. In 
the sequential case, with two levels of memory hierarchy (fast and slow), communi- 
cation means reading data items (words) from slow memory (of unbounded size), to 
fast memory (of size M) and writing data from fast memory to slow memorjQ. Words 
that are stored contiguously in slow memory can be read or written in a bundle which 
we will call a message. We assume that a message of n words can be communicated 
between fast and slow memory in time a + fin where a is the latency (seconds per mes- 
sage) and fi is the inverse bandwidth (seconds per word). We define the bandwidth cost 
of an algorithm to be the total number of words communicated and the latency cost of 
an algorithm to be the total number of messages communicated. We assume that the 
input matrices initially reside in slow memory, and are too large to fit in the smaller 
fast memory. Our goal then is to minimize both bandwidth and latency costsH 

In the parallel case, we assume p processors, each with memory of size M (or with 
larger memory size, as long as we never use more than M in each processor). We are 
interested in the communication among the processors. As in the sequential case, we 
assume that a message of n consecutively stored words can be communicated in time 
a + fin. This cost includes the time required to "pack" non-contiguous words into a 
single message, if necessary. We assume that the input is initially evenly distributed 
among all processors, so M ■ p is at least as large as the input. Again, the bandwidth 
cost and latency cost are the word and message counts respectively. However, we count 
th e number of words and messages communicated along the critical path as defined 
in | Yang and Miller 1988] (i.e., two words that are communicated simultaneously are 



counted only once), as this metric is closely related to the total running time of the 
algorithm. As before, our goal is to minimize the number of words and messages com- 
municated. 

We assume that (1) the cost per flop is the same on each processor and the commu- 
nication costs (a and fi) are the same between each pair of processors (this as sumption 
is fo r ea se of presentation and can be dropped, using [Ballard et al. 20 lH ; see Sec- 
tion I6.3D . (2) all communication is "blocking": a processor can send/receive a single 
message at a time, and cannot communicate and compute a flop simultaneously (the 
latter assumption can be dropped, affecting the running time by a factor of two at 
most), and (3) there is no communication resource contention among processors. For 
example, if processor sends a message of size n to processor 1 at time 0, and proces- 



x See IBallard et a l. 20101 for definition of a model with memory hierarchy, and a reduction from the two-level 
model. All bounds in this paper thus apply to the model with memory hierarchy as well. 
2 The sequential communication model used here is sometimes called t he two-level I/O 
model or disk access machine (D AM) model (see lAggarwal and Vitter 1988; Bende ret al. 2007] 
IChowdhury and R amachandran 2006 1). Our bandwidth cost model follows that of |Hong and Kung 1981) 
and I Irony et al. 2004] in that it assumes the block-transfer size is one word of data (B = 1 in the common 
notation). However, our model allows message sizes to vary from one word up to the maximum number of 
words that can fit in fast memory. 
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sor 2 sends a message of size n to processor 3 also at time 0, the cost along the critical 
path is a + [3n. However, if both processor and processor 1 try to send a message to 
processor 2 at the same time, the cost along the critical path will be the sum of the 
costs of each message. 

1.2. The Computation Graph and Implementations of an Algorithm 

The computation performed by an algorithm on a given input can be modeled (see 
Section[3) as a computation directed acyclic graph (CDAG) : We have a vertex for each 
input / intermediate / output argument, and edges according to direct dependencies 
(e.g., for the binary arithmetic operation x := y + z we have a directed edge from v y 
to v x and from v z to v x , where the vertices v xi v y ,v z stand for the arguments x,y,z, 
respectively). 

An implementation of an algorithm determines, in the parallel model, which arith- 
metic operations are performed by which of the p processors. This corresponds to parti- 
tioning the corresponding CDAG into p parts. Edges crossing between the various parts 
correspond to arguments that are in the possession of one processor, but are needed by 
another processor, therefore relate to communication. In the sequential model, an im- 
plementation determines the order of the arithmetic operations, in a way that respects 
the partial ordering of the CDAG (see Section [3] relating this to communication cost). 

Implementations of an algorithm may vary greatly in their communication costs. 
The 1 1 0-complexity of an algorithm is the minimum bandwidth cost of the algorithm, 
over all possible implementations. The I/O-complexityof a problem is denned to be 
the minimum I/O-complexityof all algorithms for this problem. A lower bound of the 
I/O-complexityof an algorithm is therefore a results of the form: any implementation of 
algorithm Alg requires at least X communication. An upper bound is of the form: there 
is an implementation for algorithm Alg that requires at most X communication. We 
detail below some of the I/O-complexity lower and upper bounds of specific algorithms, 
or a class of algorithms. I/O-complexity lower bounds for a problem are claims of the 
form: any algorithm for a problem P requires at least X communication. These are 
much harder to find (but see for example [Demmel, Grigori, Hoemmen, and Langou, 
2008]). 

The lower bounds in this paper are for all implementations for a family of algorithms: 
"Strassen-like" fast matrix multiplication. Generally speaking, a "Strassen-like" algo- 
rithm utilizes an algorithm for multiplying two constant-size matrices in order to re- 
cursively multiply matrices of arbitrary size; see Section [5] for precise definition and 
technical assumptions. 

1.3. Previous Work 

Consider the classical Q(n 3 ) algorithm for matrix multiplication^]. While naive im- 
plementations are communication inefficient, communication-minimizing sequen- 
tial and parallel variants of this algorithm were constructed, and proved optimal , 
by matching low er bounds ICa nnon 1969[ Hong and Kung 1981 Frigo et al. 1999 



Irony et al . 20041 . 

In IIBallard et al. 2010: Ballard et al. 2011dl we generalize the results of 



I Hong and Kung 1981| |Irony et al. 2004J regarding matrix multiplication, to ob- 
tain new 1/O-compIexity lower bounds for a much wider variety of algorithms. Most 
of our bounds are shown to be tight. This includes all "classical" algorithms for LU 
factorization, Cholesky factorization, LDL T factorization, and many for the QR factor- 
ization, and eigenvalues and singular values algorithms. Thus we essentially cover all 



3 By which we mean any algorithm that computes using the n 3 multiplications, whether this is done recur- 
sively, iteratively, block-wise or any other way. 
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direct methods of linear algebra. The results hold for dense matrix algorithms (most 
of them have 0(n 3 ) complexity), as well as sparse matrix algorithms (whose running 
time depends on the number of non-zero elements, and their locations). They apply to 
sequential and parallel algorithms, to compositions of linear algebra operations (like 
computing the powers of a matrix), and to certain graph-theoretic problems^. 

The optimal algorithms for square matrix multiplication are well known. Optimal 
algorithms for dense L U, Cholesky, QR, eigenvalue problems and the SVD are more 
recent. These include IGustavson 19971 fToledo 19971 lElmroth and Gustavson 19981 



Frigo et al. T9991 |Ahmed and Pingali 2000} IFrens and Wise 20031 [Demmel, 
Grigori, Hoemmen , and Langou, 20 08 ]; [Demmel, Grigor i, and Xiang, 2008] ; 



IBallard et al. 20091 IDavid et al. 2010[ IDemmel et al. 20101 iBallard et al. 2011H, 



and are not part of standard libraries like LAPACK [Anderson et al. 1992 1 and 



ScaLAPACK [Blackford et al. 19971 See llBallard et al. 20lTcll for more details. 



In IBallar d et al. 20101 IBallard et al. 2011cB we use the approach of 



I Irony et a l. 2004 ], based on the Loomis-Whit ney geometric theorem 
I Loomis and Whitney 1949) |Burago and Zalgaller 1988| , by embedding segments 



of the computation process into a three-dimensional cube. This a pproach, howev er, is 
not suitable when distributivity is used, as is the case in Strassen ifStrassen 19691 and 



other fast matrix multiplication algorithms (e.g., | Coppersmith and Winograd 1990 
ICohn et al. 20051 ). 

While the I/O-complexity of classic matrix multiplication and algorithms with sim- 
ilar structure is quite well understood, this is not the case for algorithms of more 
complex structure. The problem of minimizing com munication in parallel classical 
matrix multiplication was addressed [Cannon 196911 almost simul taneously with the 
publication of Strassen's fast matrix multiplication MStrassen 196911 . Moreover, an I/O- 
complexity lower bound for the classical matrix multiplication algorithm has been 



known for three decades [Hong and Kung 1981). Nevertheless, the I/O-complexity of 



Strassen's fast matrix multiplication and similar algorithms has not been resolved. 

In this paper we obtain first communication cost lower bounds for Strassen's and 
other fast matrix multiplication algorithms, in the sequential and parallel models. 
These bounds are attainable both for sequential and for parallel algorithms and so 
optimal. 

1.4. Communication Costs of Fast Matrix Multiplication 

1.4.1. Upper bound. The I/O-complexity IO (n) of Strassen's algorithm (see Algorithm 
[H Appendix lAl. applied to n-by-n matrices on a machine with fast memory of size 
M, can be bounded above as follows (for actual uses of Strassen's algo rithm, see 
[Douglas etal. 1994[ IHuss-Lederman et al. 19961 |Desprez and Suter 2004| ): Run the 
recursion until the matrices are sufficiently small. Then, read the two input sub- 
matrices into the fast memory, perform the matrix multiplication inside the fast mem- 
ory, and write the result into the slow memorjQ. We thus have IO(n) < 7-IO (^) +0(n 2 ) 

and IO (^p) = 0{M). Thus 



IO(n) = 01 —;= ) -Ml. (1) 




4 See [Michael et al. 20021 for bounds on graph-related problems, and our I Ballard et al. 2011c I for a detailed 
list of previously known and recently designed sequential and parallel algorithms that attain the above 
mentioned lower bounds. 

5 Here we assume that the recursion tree is traversed in the usual depth-first order. 
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1.4.2. Lower bound. In this paper, we obtain a tight lower bound: 

Theorem 1.1. (Main Theorem) The I / ' O -complexity IO(n) of Strassen's algo- 
rithm on a machine with fast memory of size M, assuming that no intermediate values 
are computed twic^ is 



IO{n) = nil -= ) •Ml. (2) 




It holds for any implementation and any known variant of Strassen's algorithrrQB 
This includes Winograd's 0(?i lg7 ) variant that uses 15 additions instead of 18, which 



is the most used fast matrix multiplication algorithm in practice (Douglas et al. 1994} 
IHuss-Lederman et al. 19961 [Desprez and Suter 2004 1. 



For pa rallel algorithms, using a reduction from the s equential to the parallel model 



(see e.g., | Irony et al. 2004[ or our HBallard et al. 20lTcTl ) this yields: 




COROLLARY 1.2. Let IO(n) be the 1 1 ' O -complexity of Strassen's algorithm, run on a 
machine with p processors, each with a local memory of size M. Assume that no inter- 
mediate values are computed twice. Then 

IO(n) = 

We can extend these bounds to a wider class of all "Strassen-like" fast matrix mul- 
tiplication algorithms. Note that this class does not include all fast matrix multipli- 
cation algorithms (see Section [5TT1 for definitio n of "S trassen-like" algorithms, and in 
particular the technical assumption in Section 15. 1. ID . Let Alg be any "Strassen-like" 
matrix multiplication algorithm that runs in time O(n uo ) for some 2 < uj < 3. Then, 
using the same arguments that lead to Q}, the I/O-complexityof Alg can be shown to 



be IO(n) = O H-^L*) ■ Mj . We obtain a matching lower bound: 

THEOREM 1.3. The 1 I ' O -complexity IO(n) of a recursive "Strassen-like" fast matrix 
multiplication algorithm with 0(n"°) arithmetic operations, on a machine with fast 
memory of size M is 



IO(n) = Q ( ( -j= ) • M ) . (3) 



Note that or the cubic recursive algorithm for matrix multiplication, cj = lg 8 = 3, 
and the above formula is IO{n) = fl ( (^jf) • M J = O ( -j=) and identifies with the 



lower bounds of I Hong and Kung 19811 and | Irony et al. 2004 1. While the lower bounds 
for wo = 3 and for o>o < 3 have the same form, the proofs are completely different, and 
it is not clear whether our approach can be used to prove their lower bounds and vice 
versa. 

COROLLARY 1.4. Let IO(n) be the II O -com plexity of a "Strassen-like" algorithm 
( with arithmetic performed as in Theorem \1.3h run on a machine with p processors, 



6 We assume no recomputation throughout the paper. 

7 This lower bound for the sequen tial case seems to contradi ct the upper b ound from FOCS'99 
IFrigo et al. 1999, Blelloch et al. 2008)), due to a miscalculation (see fLeiserson 2008 1 for details). 
s To obtain the lower bounds for latency costs we divide the bandwidth costs by the maximal message length, 
Al. This holds for all the lower bounds here, both in the sequential and parallel models. 
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each with a local memory of size M. Assume that no intermediate values are computed 
twice. Then 



i U0 M 

IO(n) = Q 



M J P 



1.5. The Expansion Approach 

The proof of the main theorem is based on estimating the edge expansion of the compu- 
tation directed acyclic graph (CD AG) of an algorithm. The I/O-complexityis shown to 
be closely related to the edge expansion properties of this graph. As the graph has a re- 
cursive structur e, the expansion can be analyzed directly (combina torially, similarly to 
what is done in IIMihail 1989 l lAlon et al. 2008t|Koucky et al. 2010)) or by spectral ana l- 



ysis (in the spirit of what was done for the Zig-Zag expanders I Reingold et al. 2002[ ). 
There is, however, a new technical challenge. The replacement product and the Zig-Zag 
product act similarly on all vertices. This is not what happens in our case: multiplica- 
tion and addition vertices behave differently. 

The expansion app roach is similar to the one taken by Hong and Kung 
[Hong and Kung 1981) . They use the red-blue pebble game to obtain tight lower 
bounds on the 1/O-complexityof many algorithms, including classical 8(n 3 ) matrix 
multiplication, matrix-vector multiplication, and FFT The proof is obtained by show- 
ing that the size of any subset of the vertices of the CDAG is bounded by a function 
of the size of its dominator set (recall that a dominator set D for S is a set of vertices 
such that every path from an input vertex to a vertex in S contains some vertex in D). 

On the one hand, their dominator set technique has the advantage of allowing re- 
computation of any intermediate value. We were not able to allow recomputation using 
our edge expansion approach. On the other hand, the dominator set requires large in- 
put or output. Such an assumption is not needed by the edge expansion approach, as 
the bounds are guaranteed by edge expansi on of many (inter nal) parts of the CDAG. 
In that regard, one can view the a pproach of ]Irony et al. 2004 1 (also in [De mmel, Grig- 
ori, Hoemmen, and Langou, 2008; |Ballard et al. 2010UBallard et al. 2011cl ) as an edge 
expansion assertion on the CDAGs of the corresponding classical algorithms. 

The study of expansion prope rties of a CDAG was als o suggested as one of the main 
motivations of Lev and Valiant IILev and Valiant 198 31 in their work on superconcen- 
trators. They point out many papers proving that classes of algorithms computing DFT, 
matrix inversion and other problems all have to have CDAGs with good expansion 
properties, thus providing lower bounds on the number of the arithmetic operations 
required. 

Other papers study connections between bounded space computation, and combina- 



torial expansion-related properties of the corres ponding CDAG (see e.g., [ |Savage 1994 
Bilardi and Preparata 1999} IBilardi et al. 200 0] and references therein). 



1 .6. Paper organization 

Section [2] contains preliminaries on the notions of graph expansion. In Section [3] we 
state and prove the connection between I/O-complexityand the expansion properties 
of the computation graph. In Section |4] we analyze the expansion of the CDAG of 
Strassen's algorithm. We discuss the generalization of the bounds to other algorithms 
in Section [5j and present conclusions and open problems in Section [6) 
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2. PRELIMINARIES 

2.0.1. Edge expansion. The edge expansion h[G) of a d-regular undirected graph G = 

(V,E)is: ' 

hfn\- ■ \E(U,V\U)\ 

h(G) = mm — — (4) 

ucv,\u\<\v\/2 d ■ \U\ 

where E(A, B) = Eq(A, B) is the set of edges connecting the vertex sets A and B. We 
omit the subscript G when the context makes it clear. 

2.0.2. When G is not regular. Note that CDAGs are typically not regular. If a graph 
G = (V, E) is not regular but has a bounded maximal degree d, then we can add (< d) 
loops to vertices of degree < d, obtaining a regular graph G' . We use the convention 
that a loop adds 1 to the degree of a vertex. Note that for any S C V, we have \Eq(S, V\ 
S)\ = \Ec (S, V\S)\, as none of the added loops contributes to the edge expansion of G' . 

2.0.3. Expansion of small sets. For many graphs, small sets expand better than larger 
sets. Let h s (G) denote the edge expansion for sets of size at most s in G: 

. \E(U,V\U)\ ._. 
h s {G) = mm — — . (5) 

ucv,\u\<s d ■ \U\ 

In many cases, h s (G) does not depend on |V(G% although it may decrease when s 
increases. One way of bounding h s (G) is by decomposing G into small subgraphs of 
large edge expansion. 

CLAIM 2.1. Let G = (V, E) be a d-regular graph that can be decomposed into edge- 
disjoint (but not necessarily vertex- disjoint) copies of a d' -regular graph G' = (V',E'). 
Then the edge expansion of G for sets of size at most \V'\/2 is h(G') ■ ^, namely 

h ins- ■ \E G (U,V\U)\ ^ d' 

/i|v'i(G)= mm >h(G)- — . 

V- v ucv,\u\<\v\/2 d-\U\ d 

For proving this claim, recall the definition of graph decomposition: 

Definition 2.2 (Graph decomposition). We say that the set of graphs {G- = 
(Vi, Ei)}^] is an edge-disjoint decomposition of G = (V, E)ifV = U, Vt and E = 1+)^ E%. 

PROOF, (of Claim|2j} Let U C V be of size U < \V , \/2. Let {G' t = (V t , E t )} ie[l] be an 
edge-disjoint decomposition of G, where every Gi is isomorphic to G'. Then 

\E G (U,V\U)\ = Y^lEc^V^U^^Y^HGd-d'-m 

ie[l] ie[i] 

= h{G')-d' -Y^\ u <\ >h{G')-d! -\U\ . 
te[/] 

Therefore lEci ^ U)l > h(G') ■ . □ 

3. I/O-COMPLEXITY AND EDGE EXPANSION 

In this section we recall the notion of computation graph of an algorithm, then 
show how a partition argument connects the expansion properties of the computa- 
tion graph and the I/O-complexity of the algorithm. A s imilar partition argu ment al- 
ready appeared in | Irony et al. 2004 1, and then in our [Ballard et al. 201 lei In both 
cases it is used to relate 1/O-complexity to the Loomis-Whitney geometric bound 
I Loomis and Whitney 1949) , which can be viewed, in this context, as an expansion 
guarantee for the corresponding graphs. 
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3.1. The computation graph 

For a given algorithm, we consider the computation (directed) graph G = (V, E), where 
there is a vertex for each arithmetic operation (AO) performed, and for every input 
element. G contains a directed edge (u, v), if the output operand of the AO correspond- 
ing to u (or the input element corresponding to u), is an input operand to the AO 
corresponding to v. The in-degree of any vertex of G is, therefore, at most 2 (as the 
arithmetic operations are binary). The out-degree is, in general, unbounded^, i.e., it 
may be a function of |V|. We next show how an expansion analysis of this graph can be 
used to obtain the I/O-complexity lower bound for the corresponding algorithm. 

3.2. The partition argument 

Let M be the size of the fast memory. Let O be any total ordering of the vertices that 
respects the partial ordering of the CDAG G, i.e., all the edges are going up in the 
total order. This total ordering can be thought of as the actual order in which the 
computations are performed. Let P be any partition of V into segments Si,S 2 ,.--, so 
that a segment 5, e P is a subset of the vertices that are contiguous in the total 
ordering O. 

Let Rs and Ws be the set of read and write operands, respectively (see Figure []])■ 
Namely, Rs is the set of vertices outside S that have an edge going into S, and Ws 
is the set of vertices in S that have an edge going outside of S. Then the total I/O- 
complexity due to reads of AOs in S is at least \Rg\ - M, as at most AI of the needed 
operands are already in fast memory when the execution of the segment's AOs 
starts. Similarly, S causes at least \Ws\ — M actual write operations, as at most M of 
the operands needed by other segments are left in the fast memory when the execution 
of the segment's AOs ends. The total I/O-complexity is therefore bounded below bjf3 

10 > max (\R S \ + \W S \ - 2M) . (6) 
Sep 




9 As the lower bounds are derived for the bounded out-degree case, we will show how to convert the corre- 
sponding CDAG to obtain constant out-degree, without affecting the I/O-complexity too much. 
10 One can think of this as a game: the first player orders the vertices. The second player partitions them 
into contiguous segments. The objective of the first player (e.g., a good programmer) is to order the vertices 
so that any consecutive partitioning by the second player leads to a small communication count. 
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3.3. Edge expansion and l/O-complexity 

Consider a segment S and its read and write operands Rs and Ws (see Figure Q}. If 
the graph G containing S has h(G) edge expansion^, maximum degree d and at least 
2|S'| vertices, then (using the definition of h{G)), we have 

Claim 3.1. \R S \ + \W S \ > f ■ h(G) ■ \S\ . 

PROOF. We have \E(S, V \ S)\ > h{G) ■ d ■ \S\. Either (at least) half of the edges 
E(S, V\S) touch Rs or half of them touch Ws. As every vertex is of degree d, we have 

\Rs\ + \W S \ > max{\R s \, \W S \} > \ ■ \ ■ \E{S, V\S)\> h{G) ■ \S\/2. □ 

Combining this with (O and choosing to partition V into \V\/s segments of equal size 

s, we obtain: IO > max, ^ • (M|H _ 2A4) = Q, (\V\ ■ h(G)). In many cases h(G) is too 

small to attain the desired I/O-complexity lower bound. Typically, h(G) is a decreas- 
ing function in |V(G)|, namely the edge expansion deteriorates with the increase of 
the input size and with the running time of the corresponding algorithm. This is the 
case with matrix multiplication algorithms: the cubic, as well as the Strassen and 
"Strassen-like" algorithms. In such cases, it is better to consider the expansion of G on 

small sets only: IO > max s ^ ■ ( M ^ ),s - 2Af). Choosing^the minimal s so that 

> 3M (7) 

we obtain 

I VI 

IO > — ■ M . (8) 

s 

In some cases, the computation graph G does not fit this analysis: it may not be 
regular, it may have vertices of unbounded degree, or its edge expansion may be hard 
to analyze. In such cases, we may consider some subgraph G' of G instead to obtain a 
lower bound on the I/O-complexity : 

CLAIM 3.2. Let G = (V,E) be a computation graph of an algorithm Alg. Let G' = 
(V',E r ) be a subgraph of G, i.e., V' C V and E' C E. If G' is d-regular and a = 
then the I '/ ' O -complexity of Alg is 



IO > — • - — - • M (9) 

2 s 

where s is chosen so that > 3M . 

The correctness of this claim follows from Equations (0 and ©, and from the fact 
that at least an a/2 fraction of the segments have at least # • s of their vertices in G' 
(otherwise V < f • V/s ■ s + (1 - f ) • V/s • f s < aV). We therefore have: 

LEMMA 3.3. Let Alg be an algorithm with AO(N) arithmetic operations (N be- 
ing the total input size, N = 6(n 2 ) for matrix multiplication) and computation graph 



11 The direction of the edges does not matter much for the expansion-bandwidth argument: treating all edges 
as undirected changes the I/O-complexity estimate by a factor of 2 at most. For simplicity, we will treat G as 
undirected. 

12 The existence of a value s that satisfies the condition is not always guaranteed. In the next section we 
confirm this for Strassen, for sufficiently large |V(G)| (in particular, | V(G) \ has to be larger than M). Indeed 
this is the interesting case, as otherwise all computations can be performed inside the fast memory, with no 
communication, except for reading the input once. 
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G(N) = (V, E). Let G'(N) = (V , E') be a regular constant degree subgraph of G, with 
Jj^| = 9(1). Then the I / O-complexity of Altj^on a machine with fast memory of size M 
is 

IO = n(\V'\-h s (G'(N))) fors = AO(M). (10) 

As AO{N) = e(|V|) and h s (G'(N)) for s = AO(M) is Q(h(G'(M))) (recall Claim\2J$ 
we obtain, equivalently, 

IO = n {AO{N) ■ h{G'{M))) . (11) 
4. EXPANSION PROPERTIES OF STRASSEN'S ALGORITHM 

Recall Strassen's algorithm for matrix multiplication (see Algorithm [Jin Appendix lAl) 
and consider its computation graph (see Figure |2). Let Hi be computation graph of 
Strassen's algorithm for recursion of depth i, hence H\ gn corresponds to the computa- 
tion for input matrices of size n x n. H\ g n has the following structure: 

— Encode A: generate weighted sums of elements of A (this corresponds to the left 
factors of lines 5-11 of the algorithm). 

— Similarly encode B (this corresponds to the right factors of lines 5-11 of the algo- 
rithm). 

— Then multiply the encodings of A and B element- wise (this corresponds to line 2 of 
the algorithm). 

— Finally, decode C, by taking weighted sums of the products (this corresponds to lines 
12-15 of the algorithm). 

COMMENT 4.1. DeciC is presented, for simplicity, with vertices of in-degree larger 
than two (but constant). A vertex of degree larger than two, in fact, represents a full 
binary (not necessarily balanced) tree. Note that replacing these high in-degree vertices 
with trees changes the edge expansion of the graph by a constant factor at most (as this 
graph is of constant size, and connected). Moreover, there is no change in the numbe r of 
input and output vertices. Therefore the arguments in the following proof of Lemma l-Ol 
still hold. 

4.1. The computation graph for n-by-n matrices 

Assume w.l.o.g. that n is an integer power of 2. Denote by Enc\ g n A the part of H\ g n that 
corresponds to the encoding of matrix A. Similarly, Enc\ gn B, and Dec\ gn C correspond 
to the parts of H\ s „ that compute the encoding of B and the decoding of C, respectively. 

4. 1. 1. A top-down construction of the computation graph. We next construct the computation 
graph H i+ i by constructing Dec i+ iC (from DeciC and Dec\C) and similarly construct- 
ing Enci+iA and Enc i+ iB, then composing the three parts together. 

— Duplicate Dec\C T times. 

— Duplicate DeciC four times. 

— Identify the 4 • T output vertices of the copies of DeciC with the 4 • T input vertices 
of the copies of DeciC: 

— Recall that each DeciC has four output vertices. 

— The first output vertex of the T Dec\C graphs are identified with the T input 
vertices of the first copy of DeciC. 

— The second output vertex of the T DeciC graphs are identified with the T input 
vertices of the second copy of DeciC. And so on. 



13 In Strassen's algorithm, N = 2n 2 is the number of input matrices elements and T(N) = ©(n" ) = 
(7V"W2) qi i s t h e graph Dec k C for k = lgM, see Section|4]for the definition of Dec k C. 
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Fig. 2. The computation graph of Strassen's algorithm (See Algorithm[T]in Appendix). 
Top left: DeciC. Top right: H\, Bottom left: Dec\ sn C. Bottom right: H\ gn . 



— We make sure that the jth input vertex of a copy of DeciC is identified with an 
output vertex of the jth copy of DeciC 

— We similarly obtain Enc i+ iA from EnciA and Enc^A, 

— and Ena + \B from EnaB and Enc\B. 

— For every i, Hi is obtained by connecting edges from the jth output vertices of End A 
and Enc.iB to the jth input vertex of DeciC. 

This completes the construction. Let us note some properties of these graphs. 

The graph Dec\C has no vertices which are both input and o utpu t. As all out-degrees 
are at most 4 and all in degree are at most 2 (Recall Comment l4.1D we have: 

FACT 4.2. All vertices of Dec\ gn C are of degree at most 6. 

However, Enc±A and Enc\B have vertices which are both input and output (e.g., 
An), therefore Enc\ gn A and Enc\ gn B have vertices of out-degree 0(lg n). All in-degrees 
are at most 2, as an arithmetic operation has at most two inputs. 

As H\ s „ contains vertices of large degrees, it is easier to consider Deci gn C: it contains 
only vertices of constant bounded degree, yet at least one third of the vertices of H\ g n 
are in it. 

LEMMA 4.3. (MAIN LEMMA) The edge expansion of Dec k C is 
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The proof follows below, but first note that it suffices to deduce the expansion of 
Deci gn C on small sets. Assume wl.o.g. that n is an integer power of VA/I 14 ! Then 
Deci gn C can be split into edge-disjoint copies of Deci lgM C. Using Claim [27X1 we thus 

have: 

COROLLARY 4.4. s ■ h s {Decy gn C) > MI for s = 9 • ApsV 2 . 

As Deci„ n C contains a = i of the vertices of H\ gn , Lemma [3.31 now yields Main Theo- 
rem ll.lT Note that Dec\ gn C has no input vertices, so no restriction on input replication 
is needed. 

4. 1.2. Combinatorial Estimation of the Expansion 

Proof of LemmaOOJ Let G k = (V,E) be Dec k C, and let S c V, \S\ < \V\/2. We 

next show that \E(S, V \ S)\ > c ■ d ■ \S\ ■ (f ) , where c is some universal constant, and 
d is the constant degree of Dec k C (after adding loops to make it regular). 

The proof works as follows. Recall that G k is a layered graph (with layers corre- 
sponding to recursion steps), so all edges (ex cluding loops) connect between consecu- 
tive levels of vertices. We argue (in Claim \4M that each level of G k contains about the 
sam e fra ction of S vertices, or else we have many edges leaving S. We also observe (in 
Fact l4.9D that such homogeneity (of a fraction of S vertices) does not hold between dis- 
tinct parts of the lowest level, or, again, we have many edges leaving S. We then show 
that the homogeneity between levels, combined with the heterogeneity of the lowest 
level, guarantees that there are many edges leaving S. 

Let h be the ith level of vertices of G k , so A k = \h\< \h\ < ■■■ < \h\ = A k - l+1 7 1 - 1 < 
■ ■ ■ < \h+i | = 7 k . Let Si = S n h. Let a = j^i be the fractional size of S and o-j = Jj|4 be 
the fractional size of S at level i. Due to averaging, we observe the following: 

FACT 4.5. There exist i and i' such that Oi< a < <ji<. 

Fact 4.6. 



fc+i fe+i 

\y\ = 5>i=£ifc+i 




on 2 < l' fc + 1 l < 3 . 1 nnrl 3 

bu 7 - |V| - 7 W4\M-2> aau ? - y 7) - | V | - 7 - y 7 j - 1 _^y+2- 

CLAIM 4.7. There exists d = c'(Gi) so that \E(S,V\S)nE(U,l i+ i)\ > d-d- \8 t \ ■ \U\. 

PROOF. Let G' be a G\ component connecting k with (so it has four vertices in 
li and seven in G' has no edges in E(S, V \ S) if all or none of its vertices are in 
S. Otherwise, as G' is connected, it contributes at least one edge to E(S, V \ S). The 
number of such G\ components with all their vertices in S is at most min{ai, a i+ i} ■ 4p- 



14 We may assume this, as we are dealing with a lower bound here, so it suffices to prove the assertion for 
an infinite number of n's. Alternatively, in the following decomposition argument, we leave out a few of the 
top or bottom levels of vertices of Dec lgH C, so that n is an integer power of \/~Kl and so that at most \S\/2 
vertices of S are cut off. 
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Therefore, there are at least \<r l - cr i+1 | ■ ^ G± components with at least one vertex in 
S and one vertex that is not. □ 

Claim 4.8 (Homogeneity between levels). If there exists i so that l t7 ~ g< l > 
then 

\E(S,V\S)\ >c-d-\S\ 

where c > is some constant depending on G\ only. 

Proof. Assume that there exists j so that \ a ~° j \ > Let Si = <r i+1 — By Claim 
I4.7L we have 

\E(S,V\S)\ > Y,\E(S,V\S)nE(k,l i+1 )\ 

ie [k] 

> Y^<t-d-\8i\-\U\ 

i£[k] 

> c'-d-\h\ 

ie[k] 



> c' • d ■ I ii| • I max er,; — min cr, 
\ie[fe+i] ie[k+x] 

By the initial assumption, there exists j so that > jft, therefore maxjcr 



mini tJi > jq , then 



\E(S,V\S)\>c'-d-\h\-- 



ByFactiH |^| > f • (|) fc • \V\, 



>c'-d- 



7 V 7 



\V\ 



10 



As|5|=tr.m 



>c-d. I- 
for anyc < fg • f. □ 

Let T k be a tree corresponding to the recursive construction of Gk in the following 
way (see Figure [3]): T k is a tree of height k + 1, where each internal node has four 
children. The root r of T k corresponds to l k +i (the largest level of G k )- The four children 
of r correspond to the largest levels of the four graphs that one can obtain by removing 
the level of vertices l k +i from G k - And so on. For every node u of T k , denote by V u the 
set of vertices in G k corresponding to u. We thus have = 7 fc where r is the root of 
Tk, \ V U \ = 7 k ~ x for each node u that is a child of r; and in general we have 4' tree nodes 
u corresponding to a set of size \V U \ = 7 fe ^ I+1 . Each leaf / corresponds to a set of size 1. 

For a tree node u, let us define p u = to be the fraction of S nodes in V u , and 

5 U = \Pu~ P P (u) I > where p(u) is the parent of u (for the root r we let p(r) = r). We let U be 
the ith level of T k , counting from the bottom, so t k+1 is the root and ti are the leaves. 



h h 



h+i h+i 

Fig. 3. The graph Gk and its corresponding tree Tk. 

FACT 4.9. As V r = h+i we have p r = a k +i. For a tree leaf u e t±, we have \V U \ = 1. 
Therefore p u e {0, 1}. The number of vertices u in t\ with p u = 1 is a\ ■ 

CLAIM 4.10. Let u be an internal tree node, and let u\, u 2 , u 3 , u 4 be its four children. 
Then 

\E(S, V\S)n E(V Ui ,V Uo )\>c"-d-J2 K - Pu \ ■ IK, | 

i i 

where c" = c"{G 1 ). 

Proof. The proof follows that of Claim 14771 Let G be a G\ component connecting 
V UQ with U(£[4] (so it has seven vertices in V Uo and one in each of V Ul ,V U2 ,V U3 ,V Ui ). 
G' has no edges in E(S, V\ S) if all or none of its vertices are in S. Otherwise, as G' is 
connected, it contributes at least one edge to E(S, V\S). The number of G\ components 

with all their vertices in S is at most mvn{p Uo ,p Ul ,p U2 , p U3 , p Ui } ■ . Therefore, there 

are at least max ie[4 ]{|p Uo - p Ui \} ■ > ^ • J2ie{4] ~ /°«ol ' IKJ Gi components 
with at least one vertex in S and one vertex that is not. □ 



We have 

\E(S,V\S)\ = \E(S,V\S)nE(V u ,V p(u) )\ 
uer k 



By Claim gTlOl this is 

> ^2 c" ■ d- \p u - p p[u) \ ■ \V U \ 

uGT k 

= c" ■ d ■ \P U ~ PpM \ ' 7?_1 

ie[k] ueu 

ie[fc] «eti 
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As each internal node has four children, this is 

= c" ■ d ■ \P U ~ Pp( u )\ 



where v ~ r is the path from v to the root r. By the triangle inequality for the func- 
tion | ■ | 

>c"-d-^2 \Pu - Pr\ 

vet! 



By FactHSl 



By Claim [47H1 w.l.o.g., \<r k+1 



and by Fact|4Ul 



for any c < | ■ c". 



This completes the proof of Lemma l4.31 
5. OTHER ALGORITHMS 

We now discuss the applicability of our approach to other algorithms, starting with 
other fast matrix multiplication algorithms. 

5.1. "Strassen-like" Algorithms 

A "Strassen-like" algorithm has a recursive structure that utilizes a base case: multi- 
plying two 7i -by-rio matrices using m(n ) multiplications. Given two matrices of size 
n-by-n, it splits them into n§ blocks (each of size ^-by-^), and works blockwise, ac- 
cording to the base case algorithm. Additions (and subtractions) in the base case are in- 
terpreted as additions (and subtractions) of blocks. These are performed element- wise. 
Multiplications in the base case are interpreted as multiplications of blocks. These are 
performed by recursively calling the algorithm. The arithmetic count of the algorithm 

is then T(n) = m(n ) ■ T (jA + 0{n 2 ), so T(n) = 8(n w °) where w = log„ m(n ). 

This is the structure of a ll the fast matrix multiplication algorithms that 
were obtained since Strassen's llPan 1980I IBini 1980I |Schonhage 198ll |Rpmani 1982I 
Coppersmith and Winograd 1982[ IStrassen 1987t |Coppersmitn and Winograd 1987| , 
(see ||Burgisser et al. 1997| For discussion of these algorithms), as well as 
HCohn et al. 2005II . where the base case utilizes a novel group-theoretic approach. 
In fact, any fast matrix multiplication algorithm can be converted into this form 



>J'-d-\h\-((l-(Ti)-p r + tri-(l-pr)) 

< jo and \a 1 - o\jo < i. As p r = a k+1 , 

3 



> --c" -d-\k\-a 



>c-d-\S\ 
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IRaz 2003H . and can even be made numerically stable while preserving this form [Dem- 
mel, Dumitriu, Holtz, and Kleinberg, 2007]. 

5.1.1. A critical technical assumption. For our technique to work, we further demand 
that the DeciC part of the computation graph is a conne cted graph, in order to be 
"Strassen-like" (this was assumed in the proof of Claim I4.7D. Thus the "St rassen- 
like" class includes Winograd's variant of Strassen's algorithm [Winograd 1971 1, which 



uses 15 additions rather than 18, but not the cubic algorithm, where Dec\C is com 
posed of four disconnected graphs (corresponding to the four outputs). We conjecture 
that Dec\C is indeed connected for all existing fast matrix-multiplication algorithms. 
We note that the dem and of connectivity of DeciC may be waved in some cases (see 
Ballard et al. 2011el ). 

5.1.2. The communication costs of "Strassen-like" algorithms. To prove Theorem I1.3I which 
generalizes the I/O-complexity lower bound of Strassen's algorithm (Theorem ll.il> t o all 
"Strassen-like" algorithms, we note the following: The entire proof ofTheorem ll.il and 
in particular, the computations in the proof of Lemma I4.31 hold for any "Strassen-like" 
algorithm, where we plug in nQ,m(n ), and m "° ) instead of 4, 7, and |. For bounding 

the asymptotic I/O-complexity , we do not care about the number of internal vertices of 
DeciC; we need only to know that DeciC is connected (this critical technical assump- 
tion is used in the proof of Claim l-OK and to know the sizes n and m(n ). The only 
nontrivial adjustment is to show the equivalent of Fact l4~2l that the graph Dec\ ogn C is 
of bounded degree. 

CLAIM 5.1. The Deci ogn C graph of any "Strassen-like" algorithm is of degree 
bounded by a constant. 

Proof. If the set of input vertices of Dec\C, and the set of its output vertices are 
disjoint, then Deci ogn C is of constant bounded degree (its maximal degree is at most 
twice the largest degree of Dec\C). 

Assume (towards contradiction) that the base graph DeciC has an input vertex 
which is also an output vertex. An output vertex represents the inner product of two 
no-long vectors, i.e., the corresponding row-vector of A and column vector of B. The 
corresponding bilinear polynomial is irreducible. This is a contradiction, since an in- 
put vertex represents the multiplication of a (weighted) sum of elements of A with a 
(weighted) sum of elements of B. □ 

5.2. Uniform, Non-stationary Fast Matrix Multiplication Algorithms 

Another class of matrix multiplication algorithms, the uniform, non- stationary algo- 
rithms, allows mixing of schemes of the previous ("Strassen-like") class. In each recur- 
sive level, a different scheme may be used. The CDAG has a repeating structure inside 
one level, but the structure may differ between two distinct levels. This class includes, 
for example, algorithms that optimize for input sizes (for sizes that are not an integer 
power of a constant integer). The class also includes algorithms that cut the recursion 
off at some point, and th en switch to the classical algorithm. For these an d other im- 
plementation issues, see [ Douglas et al. 1994} IHuss-Lederman et al. 19961 (sequential 



model) and jDesprez and Suter 2004| (parallel model). The I/ O-complexity lower b ound 
generalizes to this class, and will appear in a separate note BBallard et al. 20l"lel . 

5.3. Non-uniform, Non-stationary Fast Matrix Multiplication Algorithms 

A third class, the non-uniform, non-stationary algorithms, allows recursive calls to 
have different structure, even when they refer to multiplication of matrices in the 
same recursive level. It is not clear how to analyze the expansion of the CDAG of an 
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algorithm in the third class, although we are not aware of any a lgorithms in this class. 
Such an analysis, applied to the base case of llCohn et al. 20051 . may improve the I/O- 
complexity lower bound for fast matrix multiplication by a (large) constant. 

5.4. Multiplying Rectangular Matrices 

Multiplication of rectangular matrices have see n a series of increa singly fast algo- 



rithms culminating in Coppersmith's algorithm [Coppersmith 1997) . It is possible to 
extend our approach and obtain the first lower bounds on the communication costs 
for these algorithms, and sho w that in some cases they are attainable, and therefore 
optimal DBallard et al. 2011dl . 

5.5. Other Algorithms 

Fast matrix multiplication algorithms are basic building blocks in many fast algo- 
rithms in linear algebra, such as algorithms for LU, QR, and solving the Sylvester 
equation [Demmel, Dumitriu, and Holtz, 2007]. Therefore, I/O-complexity lower 
bounds for these algorith ms can be derived fr om our lower bounds for fast matrix 
multiplication algorithms I B allard et al. 2011all . For example, a lower bound on LU (or 
QR, etc.) follows when the fast matrix multiplication algorithm is called by the LU al- 
gorithm on sufficiently large subblocks of the matrix. This is the case in the algorithms 
of [Demmel, D umitriu, and Holtz, 2 007], and we can then deduce matching lower and 
upper bounds HBallard et al. 2011ai 

6. CONCLUSIONS AND OPEN PROBLEMS 

We obtained a tight lower bound for the I/O-complexity of Strassen's and "Strassen- 
like" fast matrix multiplication algorithms. These bounds are optimal for the sequen- 
tial model with two memory levels and with memory hierarchy. The lower bounds 
extend to the parallel model and other models. Recently these bounds were attained 
(up to an O(logp) factor) by new parallel implementa tions, for Strassen's algorithm 
and for "Strassen-like" algorithms HBallard et al. 20lTTl . 

6.1. Memory constraints for the classical and "Strassen-like" matrix multiplication 
algorithms 

Some (parallel) algorithms require very little, up to a constant factor extra memory 
beyond what is necessary to keep the input and output. These are sometimes called 
linear space algorithms. One class of such algorithms are the "2D" algorithms for clas- 
sical matrix multiplication, that use two-dimensional grid of processors. Here we al- 
low M = 6 hf\ local memory use (recall that p is the number of processors, and n 

the dimension of the matrices), thus no replication of the input matrices is allowed 
llCannon 19691 

If the underlying grid of p processors is a three-dimensional mesh, and the 
available memory per processor is larger by a factor of p* than the min- 
imum necessary to st ore the input and output matrices, then a "3D" algo- 
rithm can be used (see UDekel et al. 198ll Aggarwal et al. 1990[ Agarwal et al. 1995 



IMcColl and Tiskin 199911 ). These "3D" algorithms can reduce the communication cost 
by a factor of p 1 / 6 , down to 8 ( ^ J [Aggarwal et al. 1990| , attaining the lower bounds 



[Irony et al. 2004 IBallard et al. 2011cl that take into account any amount of replica- 
tion. 

Recently, Demmel and Solomonik BSolomonik and Demmel 20111 showed how to 
combine these two extremes into one algorithm (named "2.5D") and obtained a commu- 
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nication efficient implementation for classical matrix multiplication, for local memory 
size M = e(c-^j for any 1 < c < pi See TableEU 

Using Corollaries 11.21 and 1 1.41 and plugging in M = (^y ), M = 6 (^j , and 

M = 6 • we obtain corresponding lower bounds for "Strassen-like" algorithm 

with various restriction local memory sizes. These were recently at tained by a new par- 
allel imple men tation for Strassen and "Strassen-like" algorithms MBallard et al. 20TT|| 
(see Table \6A) . Interestingly, the numerators here do not depend on w . Thus, an im- 
provement of ljq (the exponent of the arithmetic cost of the algorithm) affects only the 
power of p in the denominator. 





Classical algorithms 




'Strassen-like 


" algorithms 








2 < 


u)o < 3 


Lo 


wer bound 


Attained by 


Lower bound 


Attained by 


Parallel 

M = 

i 

1 < C < p3 


I Irony et al. 20041 

"((*)"•*) 




n ( 


[here] 


P J 












ICannon 19691 


n 






[Ballard, 

Demmel, 

Holtz, 

Rom, 

Schwartz, 

2011] 


n 






IDekel et al. 19811 

JAggarwal et al. 1990) 


n 












[Solomonik and 
Demmel 2011] 


n | 








i i 





Table I. 



6.2. Recursive Implementations 

In some cases, the simplest recursive implementation of an algorith m turns out to be 
communication-optimal (e.g., in the cases of matrix multiplication IFrigo et al. 1999| 



and Cholesky decomposition [Ahmed an d Ping ali 2000tlBallard et al. 20101 . but not in 
the case of LU decomposition UToledo 19971 . which is bandwidth optimal but not la- 
tency optimal). This leads to the question: when is the communication-optimality of an 
algorithm determined by the expansion properties of the corresponding computation 
graphs? In this work we showed that such is the case for "Strassen-like" fast matrix 
multiplication algorithms. 

6.3. Other Hardware 

It is of great interest to construct new models general enough to capture the rich and 
evolving design space of current and predicted future computers. Such models can be 
homogeneous, consisting of many layers, where the components of each layer are the 
same (e.g., a supercomputer with many identical multi-core chips on a board, many 
identical boards in a rack, many identical racks, and many identical levels of associ- 
ated memory hierarchy); or heterogeneous, with components with different properties 
residing on the same level (e.g., CPUs alongside GPUs, where the latter can do some 
computations very quickly, but are much slower to communicate with). 



Graph Expansion and Communication Costs of Fast Matrix Multiplication 



19 



Some experien ce has be en acquired with such sys tems (see the MAGMA project 
BBaboulin et al. I . and also llVolkov and Demmel 20081 for using GPU assisted linear 
algebra computation ). A first step in analy zing such systems h as been recently intro- 
duced by Ballard, Demmel, and Gearhart BBallard et al. 20TT]| . where they modeled 
heterogenous shared memory architectures, such as mixed GPU/CPU architecture, 
and obtained tight lower and upper bounds for 0(n 3 ) matrix multiplication. 

Note that we can similarly generalize Corollaries 11.21 and 11.41 to other models, such 
as the heterogenous model and shared memory model. The reduction is achieved by 
observing the communication of a single processor. 

However, there is currently no systematic theoretic way of obtaining upper and lower 
bounds for arbitrary hardware models. Expanding such results to other architectures 
and algorithmic techniques is a challenging goal. For example, recursive algorithms 
tend to be cache oblivious and communication optimal for the sequential hierarchy 
model. Finding an equivalent technique that would work for an arbitrary architecture 
is a fundamental open problem. 
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A. STRASSEN'S FAST MATRIX MULTIPLICATION ALGORITHM 

Strassen's original algorithm follows MStrassen 196911 . See [[Winograd 19711 for Wino- 
grad's variant, which reduces the number of additions. 

Algorithm 1 Matrix Multiplication: Strassen's Algorithm 

Input: Two n x n matrices, A and B. 
l: if n = 1 then 

2: Cii = An ■ Bu 

3: else 

4: {Decompose A into four equal square blocks A = ^ n ^j 12 

and the same for B.} 

5: Mi = (An + A 22 ) ■ (B n + B 22 ) 

6: M 2 = (A 21 + A 22 ) ■ Bu 

7: M 3 = A n ■ (B 12 - B 22 ) 

8: M 4 = A 22 ■ (B 21 - Bu) 

9: M 5 = (An +A 12 ) ■ B 22 

10: M 6 = (A 21 - An) ■ (B n + B 12 ) 

ll: M 7 = (A 12 - A 22 ) ■ (B 21 + B 22 ) 

12: Cu = Mi + M 4 - M 5 + M 7 

13: C*i2 = M 3 + M 5 
14: C 2 i = M 2 + M 4 

15: C* 22 = Mi - Af 2 + M 3 + M 6 
16: end if 
17: return C 



